- Fine-Tuning Assistant
- The Fine-Tuning Assistant skill guides you through the process of adapting pre-trained models to your specific use case. Fine-tuning can dramatically improve model performance on specialized tasks, teach models your preferred style, and add capabilities that prompting alone cannot achieve.
- This skill covers when to fine-tune versus prompt engineer, preparing training data, selecting base models, configuring training parameters, evaluating results, and deploying fine-tuned models. It applies modern techniques including LoRA, QLoRA, and instruction tuning to make fine-tuning practical and cost-effective.
- Whether you are fine-tuning GPT models via API, running local training with open-source models, or using platforms like Hugging Face, this skill ensures you approach fine-tuning strategically and effectively.
- Core Workflows
- Workflow 1: Decide Whether to Fine-Tune
- Assess
- the problem:
- Can prompting achieve the goal?
- Is the task format or style consistent?
- Do you have quality training data?
- Is this worth the investment?
- Compare
- approaches:
- Approach
- When to Use
- Investment
- Better prompts
- First attempt, variable tasks
- Low
- Few-shot examples
- Consistent format, limited data
- Low
- RAG
- Knowledge-intensive, dynamic data
- Medium
- Fine-tuning
- Consistent style, specialized task
- High
- Evaluate
- requirements:
- Minimum 100-1000 quality examples
- Clear evaluation criteria
- Budget for training and hosting
- Decision
- Fine-tune only if prompting/RAG insufficient Workflow 2: Prepare Fine-Tuning Dataset Collect training examples: Representative of target use case High quality (no errors in outputs) Diverse coverage of task variations Format for training: {"messages": [ {"role": "system", "content": "You are a helpful assistant..."}, {"role": "user", "content": "User input here"}, {"role": "assistant", "content": "Ideal response here"} ]} Quality assurance : Review sample of examples manually Check for consistency in style/format Remove duplicates and low-quality entries Split train/validation/test sets Validate dataset format Workflow 3: Execute Fine-Tuning Select base model: Consider size vs capability tradeoff Match model to task complexity Check licensing for your use case Configure training:
OpenAI fine-tuning
training_config
{ "model" : "gpt-4o-mini-2024-07-18" , "training_file" : "file-xxx" , "hyperparameters" : { "n_epochs" : 3 , "batch_size" : "auto" , "learning_rate_multiplier" : "auto" } }
LoRA fine-tuning (local)
lora_config
{ "r" : 16 ,
Rank
- "lora_alpha"
- :
- 32
- ,
- "lora_dropout"
- :
- 0.05
- ,
- "target_modules"
- :
- [
- "q_proj"
- ,
- "v_proj"
- ]
- }
- Monitor
- training:
- Watch loss curves
- Check for overfitting
- Validate on held-out set
- Evaluate
- results:
- Compare to baseline model
- Test on diverse inputs
- Check for regressions
- Quick Reference
- Action
- Command/Trigger
- Decide approach
- "Should I fine-tune for [task]"
- Prepare data
- "Format data for fine-tuning"
- Choose model
- "Which model to fine-tune for [task]"
- Configure training
- "Fine-tuning parameters for [goal]"
- Evaluate results
- "Evaluate fine-tuned model"
- Debug training
- "Fine-tuning loss not decreasing"
- Best Practices
- Start with Prompting
-
- Fine-tuning is expensive; exhaust cheaper options first
- Can better prompts achieve 80% of the goal?
- Try few-shot examples in the prompt
- Consider RAG for knowledge tasks
- Quality Over Quantity
-
- 100 excellent examples beat 10,000 mediocre ones
- Each example should be a gold standard
- Better to have humans verify examples
- Remove anything you wouldn't want the model to learn
- Match Format to Use Case
-
- Training examples should mirror real usage
- Same prompt structure as production
- Realistic input variations
- Cover edge cases explicitly
- Don't Over-Train
-
- More epochs isn't always better
- Watch validation loss for overfitting
- Start with 1-3 epochs
- Early stopping when validation plateaus
- Evaluate Properly
-
- Training loss isn't the goal
- Use held-out test set
- Compare to baseline on same tests
- Check for capability regressions
- Test on edge cases explicitly
- Version Everything
- Fine-tuning is iterative Version your training data Track experiment configurations Document what worked and what didn't Advanced Techniques LoRA (Low-Rank Adaptation) Efficient fine-tuning for large models: from peft import LoraConfig , get_peft_model lora_config = LoraConfig ( r = 16 ,
Rank of update matrices
lora_alpha
32 ,
Scaling factor
target_modules
[ "q_proj" , "v_proj" , "k_proj" , "o_proj" ] , lora_dropout = 0.05 , bias = "none" , task_type = "CAUSAL_LM" )
Apply LoRA to base model
model
get_peft_model ( base_model , lora_config )
Only ~0.1% of parameters are trainable
trainable_params
sum ( p . numel ( ) for p in model . parameters ( ) if p . requires_grad ) QLoRA (Quantized LoRA) Fine-tune large models on consumer hardware: from transformers import BitsAndBytesConfig bnb_config = BitsAndBytesConfig ( load_in_4bit = True , bnb_4bit_quant_type = "nf4" , bnb_4bit_compute_dtype = torch . bfloat16 , bnb_4bit_use_double_quant = True )
Load model in 4-bit
model
AutoModelForCausalLM . from_pretrained ( "meta-llama/Llama-2-7b-hf" , quantization_config = bnb_config )
Apply LoRA on top
model
get_peft_model ( model , lora_config ) Instruction Tuning Dataset Creation Convert raw data to instruction format: def create_instruction_example ( raw_data ) : return { "messages" : [ { "role" : "system" , "content" : "You are a customer service agent for TechCorp..." } , { "role" : "user" , "content" : f"Customer inquiry: { raw_data [ 'inquiry' ] } " } , { "role" : "assistant" , "content" : raw_data [ 'ideal_response' ] } ] }
Apply to dataset
instruction_dataset
[ create_instruction_example ( d ) for d in raw_dataset ] Evaluation Framework Comprehensive assessment of fine-tuned models: def evaluate_fine_tuned_model ( model , test_set , baseline_model = None ) : results = { "task_accuracy" : [ ] , "format_compliance" : [ ] , "style_match" : [ ] , "regression_check" : [ ] } for example in test_set : output = model . generate ( example . input )
Task-specific accuracy
results [ "task_accuracy" ] . append ( check_correctness ( output , example . expected ) )
Format compliance
results [ "format_compliance" ] . append ( matches_expected_format ( output ) )
Style matching (for style transfer tasks)
results [ "style_match" ] . append ( style_similarity ( output , example . expected ) )
Regression on general capabilities
if baseline_model : results [ "regression_check" ] . append ( compare_general_capability ( model , baseline_model , example ) ) return { k : np . mean ( v ) for k , v in results . items ( ) } Curriculum Learning Order training data by difficulty: def create_curriculum ( dataset ) :
Score examples by complexity
scored
[ ( score_complexity ( ex ) , ex ) for ex in dataset ] scored . sort ( key = lambda x : x [ 0 ] )
Create epochs with increasing difficulty
n
len ( scored ) curriculum = { "epoch_1" : [ ex for _ , ex in scored [ : n // 3 ] ] ,
Easy
"epoch_2" : [ ex for _ , ex in scored [ : 2 * n // 3 ] ] ,
Easy + Medium
"epoch_3" : [ ex for _ , ex in scored ] ,
All
} return curriculum Common Pitfalls to Avoid Fine-tuning when better prompting would suffice Using low-quality or inconsistent training examples Not holding out a proper test set Training for too many epochs (overfitting) Ignoring capability regressions from fine-tuning Not versioning training data and configurations Expecting fine-tuning to add factual knowledge (use RAG instead) Fine-tuning on data that doesn't match production use